{
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "## Convert any Image dataset to Lance\n",
        "\n",
        "This notebook demonstrates for transforming any Image Dataset into the Lance format. It provides a straightforward solution for converting diverse image datasets into a standardized Lance format.\n",
        "\n",
        "![image (1).png]()"
      ],
      "metadata": {
        "id": "KkwnfLYfEOHZ"
      }
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UByKm8Q6dCEB"
      },
      "source": [
        "### Imports"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "J8jICxVldCEC"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "import pandas as pd\n",
        "import pyarrow as pa\n",
        "import lance\n",
        "import time\n",
        "from tqdm import tqdm\n",
        "\n",
        "import warnings\n",
        "\n",
        "warnings.simplefilter(\"ignore\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AfUoIr4TdCEF"
      },
      "source": [
        "### Set the variable according to your Image dataset\n",
        "\n",
        "Assign the path to your image dataset to the variable `image_dataset`. This dataset should contain your images organized into training, testing, and validation folders. These images will be used to convert them into Lance format.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5JvY3ucxdCEF"
      },
      "outputs": [],
      "source": [
        "image_dataset = \"image_dataset\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4ULduqF2dCEG"
      },
      "source": [
        "### Processing the Images\n",
        "\n",
        "The `process_images` function is the central component of this notebook, responsible for transforming images from the training, testing, and validation folders into Lance format. This format typically includes essential attributes such as `image`, `filename`, `category`, and `data_type`.\n",
        "\n",
        "Specifically, `image` represents the actual image data, `filename` denotes the name of the file, `category` indicates the category to which the image belongs, and `data_type` specifies whether the image is from the training, testing, or validation set."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "S1wX3JVmdCEG"
      },
      "outputs": [],
      "source": [
        "def process_images():\n",
        "    # Get the current directory path\n",
        "    current_dir = os.getcwd()\n",
        "    images_folder = os.path.join(current_dir, image_dataset)\n",
        "    print(images_folder)\n",
        "\n",
        "    # Define schema for RecordBatch\n",
        "    schema = pa.schema(\n",
        "        [\n",
        "            (\"image\", pa.binary()),\n",
        "            (\"filename\", pa.string()),\n",
        "            (\"category\", pa.string()),\n",
        "            (\"data_type\", pa.string()),\n",
        "        ]\n",
        "    )\n",
        "\n",
        "    # Iterate over the data types (train, test, valid)\n",
        "    for data_type in [\"train\", \"test\", \"val\"]:\n",
        "        data_type_folder = os.path.join(images_folder, data_type)\n",
        "\n",
        "        # Iterate over the categories within each data type\n",
        "        for category in os.listdir(data_type_folder):\n",
        "            category_folder = os.path.join(data_type_folder, category)\n",
        "\n",
        "            # Iterate over the images within each category\n",
        "            for filename in tqdm(\n",
        "                os.listdir(category_folder), desc=f\"Processing {data_type} - {category}\"\n",
        "            ):\n",
        "                # Construct the full path to the image\n",
        "                image_path = os.path.join(category_folder, filename)\n",
        "\n",
        "                # Read and convert the image to a binary format\n",
        "                with open(image_path, \"rb\") as f:\n",
        "                    binary_data = f.read()\n",
        "\n",
        "                image_array = pa.array([binary_data], type=pa.binary())\n",
        "                filename_array = pa.array([filename], type=pa.string())\n",
        "                category_array = pa.array([category], type=pa.string())\n",
        "                data_type_array = pa.array([data_type], type=pa.string())\n",
        "\n",
        "                # Yield RecordBatch for each image\n",
        "                yield pa.RecordBatch.from_arrays(\n",
        "                    [image_array, filename_array, category_array, data_type_array],\n",
        "                    schema=schema,\n",
        "                )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Gcc4JvAYdCEI"
      },
      "source": [
        "### Creating a Lance Dataset\n",
        "\n",
        "This function, `write_to_lance`, is designed to convert a PyArrow Table into a Lance dataset. It begins by defining the schema for the Lance dataset, specifying fields such as `image`, `filename`, `category`, and `data_type` , make sure the schema is the same as the one defined in the `process_images` function.\n",
        "\n",
        "Once the schema is established, the function determines the path for saving the Lance file, leveraging the current working directory and the provided `image_dataset` variable. It then initializes a RecordBatchReader using the defined schema and the data obtained from the `process_images` function."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IhfZI177dCEJ"
      },
      "outputs": [],
      "source": [
        "# Function to write PyArrow Table to Lance dataset\n",
        "def write_to_lance():\n",
        "    # Create an empty RecordBatchIterator\n",
        "    schema = pa.schema(\n",
        "        [\n",
        "            pa.field(\"image\", pa.binary()),\n",
        "            pa.field(\"filename\", pa.string()),\n",
        "            pa.field(\"category\", pa.string()),\n",
        "            pa.field(\"data_type\", pa.string()),\n",
        "        ]\n",
        "    )\n",
        "\n",
        "    # Specify the path where you want to save the Lance file\n",
        "    current_dir = os.getcwd()\n",
        "    images_folder = os.path.join(current_dir, image_dataset)\n",
        "    lance_file_path = os.path.join(images_folder, f\"{image_dataset}.lance\")\n",
        "\n",
        "    reader = pa.RecordBatchReader.from_batches(schema, process_images())\n",
        "    lance.write_dataset(\n",
        "        reader,\n",
        "        lance_file_path,\n",
        "        schema,\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tW9GeYJddCEL"
      },
      "source": [
        "### Load a Lance Dataset and Visualize it in Pandas Dataframe"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UQHGcmzPdCEO"
      },
      "source": [
        "`loading_into_pandas` function is designed to load a Lance dataset into a Pandas dataframe. It let's you see your Lance dataset in a pandas dataframe.\n",
        "\n",
        "The function takes the path to the Lance file as an argument and returns a pandas dataframe. Make sure the schema is the same as the one defined during the Lance dataset generation, refer to `process_images` function and also make sure the path to the Lance file is correct."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4wwa4FQmdCEP"
      },
      "outputs": [],
      "source": [
        "def loading_into_pandas():\n",
        "    # Load Lance file from the same folder\n",
        "    current_dir = os.getcwd()\n",
        "    images_folder = os.path.join(current_dir, image_dataset)\n",
        "    uri = os.path.join(images_folder, \"image_dataset.lance\")\n",
        "\n",
        "    ds = lance.dataset(uri)\n",
        "\n",
        "    # Accumulate data from batches into a list\n",
        "    data = []\n",
        "    for batch in tqdm(\n",
        "        ds.to_batches(\n",
        "            columns=[\"image\", \"filename\", \"category\", \"data_type\"], batch_size=10\n",
        "        ),\n",
        "        desc=\"Loading batches\",\n",
        "    ):\n",
        "        tbl = batch.to_pandas()\n",
        "        data.append(tbl)\n",
        "\n",
        "    # Concatenate all DataFrames into a single DataFrame\n",
        "    df = pd.concat(data, ignore_index=True)\n",
        "    print(\"Pandas DataFrame is ready\")\n",
        "    print(\"Total Rows: \", df.shape[0])"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "env",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.2"
    },
    "colab": {
      "provenance": []
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}